fix(approx_fns): use exact percentile when no compression by aryan-212 · Pull Request #21388 · apache/datafusion

aryan-212 · 2026-04-05T18:43:34Z

Which issue does this PR close?

Closes #.

Rationale for this change

DataFusion's approx_percentile_cont / approx_median use a t-digest internally. The t-digest's interpolation step assumes centroids represent clusters of multiple points. But if the number of input rows is small (≤ the digest's max_size / compression threshold), no compression ever happens: every centroid has weight 1 and corresponds to exactly one input value.

In that regime, interpolation is not just unnecessary, it is actively wrong. The t-digest interpolates between adjacent centroids based on where the rank falls inside the centroid's weight, using half-deltas to neighbors. When every centroid has weight 1, this produces values that drift away from any actual data point.

This is particularly surprising for users running small queries or unit tests, they expect percentile functions on a handful of values to return one of those values.

Concrete Example

Let's take a small example from the TPCDS Schema

select cc_sq_ft from call_center;

none	cc_sq_ft
1	6144
2	6144
3	19345
4	21156
5	21156
6	22743
7	34643
8	42935
9	52514
10	65772
11	76815
12	84336
13	105138
14	119886

Now if we take a small APPROX_PERCENTILE query like:

select approx_percentile(cc_sq_ft, 0.85) from call_center limit 50

From here, 0.85 * 14 yields 11.9 or 12 so the output for the above APPROX_PERCENTILE query should be 84336 and that is what we get when we run the same query in Databricks

But in DataFusion this comes up as:

This PR aims to fix this.

What was wrong before

Prior to this change, when no t-digest compression occurred, estimate_quantile still ran the t-digest interpolation path. This produced values that were:

Neither exact continuous percentiles (like percentile_cont)
Nor exact discrete percentiles (like percentile_approx / Databricks)
Just t-digest approximation artifacts on already-exact data

For example, approx_median on the 10-value window frame [-85, -72, -56, -48, -43, -25, -12, -5, 45, 83] returned -32 — not -34 (the true continuous median) and not -43 (the discrete nearest-rank median).

What changes are included in this PR?

tdigest.rs: When no compression has occurred (self.count == self.centroids.len()), bypass the t-digest interpolation and use exact_quantile instead. This method uses the nearest-rank (ceiling) method: index = ceil(q * n) - 1, which returns an actual observed data value — matching Databricks' percentile_approx / approx_percentile semantics.
Test expectation updates: Updated snapshot and SQL logic test expectations across:
- datafusion/core/tests/dataframe/mod.rs — window_using_aggregates snapshot
- datafusion/sqllogictest/test_files/aggregate.slt — approx_median, approx_percentile_cont, and approx_percentile_cont_with_weight test expectations
- datafusion/sqllogictest/test_files/aggregate_skip_partial.slt — approx_median with grouping, nulls, and filters
- datafusion/sqllogictest/test_files/metadata.slt — approx_median(distinct id) on small table

Are these changes tested?

Yes. All existing tests have been updated to reflect the new behavior. The key tests are:

window_using_aggregates — window function with approx_median over varying frame sizes
aggregate.slt — approx_percentile_cont at various percentiles (0.5, 0.95), including Float16/Float64/decimal types, with and without weights
aggregate_skip_partial.slt — approx_median with GROUP BY, nullable columns, and FILTER clauses
metadata.slt — approx_median(distinct id) regression test

Are there any user-facing changes?

Yes. approx_percentile_cont, approx_median, and approx_percentile_cont_with_weight will now return exact nearest-rank values (matching Databricks behavior) when the input dataset is small enough that no t-digest compression occurs (fewer than ~100 values per group by default). For larger datasets where compression happens, the existing t-digest approximation behavior is unchanged.

This means approx_median and percentile_cont(0.5) may now return different values for small datasets — this is expected and consistent with how Databricks distinguishes approximate vs exact percentile semantics.

aryan-212 · 2026-04-07T07:06:39Z

How Databricks treats `percentile` vs `approx_percentile`

Databricks draws a clear semantic difference between its two percentile functions:

Function	Semantics	Behavior
`percentile` / `percentile_cont`	Continuous — interpolates between adjacent values	`median([1, 2])` = 1.5
`percentile_approx` / `approx_percentile`	Discrete — returns an actual observed value from the dataset	`approx_median([1, 2])` = 1

This was verified by running the equivalent window query on Databricks against the same 21-row dataset used in DataFusion's window_using_aggregates test. The Databricks output confirmed that percentile_approx picks the nearest-rank value (no interpolation), while percentile interpolates.

github-actions bot added the functions Changes to functions implementation label Apr 5, 2026

aryan-212 force-pushed the approx-percentile-fixes branch 2 times, most recently from 22718b8 to e997594 Compare April 6, 2026 06:02

github-actions bot added the core Core DataFusion crate label Apr 6, 2026

aryan-212 force-pushed the approx-percentile-fixes branch 2 times, most recently from 40f862d to 95a4eff Compare April 6, 2026 06:26

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Apr 6, 2026

aryan-212 force-pushed the approx-percentile-fixes branch 2 times, most recently from d8339ff to 4f86249 Compare April 6, 2026 07:02

aryan-212 force-pushed the approx-percentile-fixes branch from f0ef5fd to 48e384b Compare April 7, 2026 08:19

aryan-212 added 3 commits April 7, 2026 13:49

fix(approx_fns): use exact percentile when no compression

fa95008

fix test

c0421a1

tests: more test corrections

57fae2e

aryan-212 force-pushed the approx-percentile-fixes branch from 48e384b to 57fae2e Compare April 7, 2026 08:19

nimalan-e6x approved these changes Apr 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(approx_fns): use exact percentile when no compression#21388

fix(approx_fns): use exact percentile when no compression#21388
aryan-212 wants to merge 3 commits intoapache:mainfrom
aryan-212:approx-percentile-fixes

aryan-212 commented Apr 5, 2026 •

edited

Loading

Uh oh!

aryan-212 commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aryan-212 commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

Concrete Example

What was wrong before

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

aryan-212 commented Apr 7, 2026

How Databricks treats percentile vs approx_percentile

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aryan-212 commented Apr 5, 2026 •

edited

Loading

How Databricks treats `percentile` vs `approx_percentile`